Machine Learning - Guided Optimization (MLGO)¶
Introduction¶
MLGO refers to integrating ML techniques (primarily) to replace heuristics within LLVM with machine learned models.
Currently the following heuristics feature such integration:
Inlining for size
Register allocation (LLVM greedy eviction heuristic) for performance
This document is an outline of the tooling and APIs facilitating MLGO.
Note that tools for orchestrating ML training are not part of LLVM, as they are dependency-heavy - both on the ML infrastructure choice, as well as choices of distributed computing. For the training scenario, LLVM only contains facilities enabling it, such as corpus extraction, training data extraction, and evaluation of models during training.
Corpus Tooling¶
Within the LLVM monorepo, there is the mlgo-utils
python packages that
lives at llvm/utils/mlgo-utils
. This package primarily contains tooling
for working with corpora, or collections of LLVM bitcode. We use these corpora
to train and evaluate ML models. Corpora consist of a description in JSON
format at corpus_description.json
in the root of the corpus, and then
a bitcode file and command line flags file for each extracted module. The
corpus structure is designed to contain sufficient information to fully
compile the bitcode to bit-identical object files.
Synopsis¶
Extracts a corpus from some form of a structured compilation database. This tool supports a variety of different scenarios and input types.
Options¶
- --input¶
The path to the input. This should be a path to a supported structured compilation database. Currently only
compile_commands.json
files, linker parameter files, a directory containing object files (for the local ThinLTO case only), or a JSON file containing a bazel aquery result are supported.
- --input_type¶
The type of input that has been passed to the
--input
flag.
- --output_dir¶
The output directory to place the corpus in.
- --num_workers¶
The number of workers to use for extracting bitcode into the corpus. This defaults to the number of hardware threads available on the host system.
- --llvm_objcopy_path¶
The path to the llvm-objcopy binary to use when extracting bitcode.
- --obj_base_dir¶
The base directory for object files. Bitcode files that get extracted into the corpus will be placed into the output directory based on where their source object files are placed relative to this path.
- --cmd_filter¶
Allows filtering of modules by command line. If set, only modules that much the filter will be extracted into the corpus. Regular expressions are supported in some instances.
- --thinlto_build¶
If the build was performed with ThinLTO, this should be set to either
distributed
orlocal
depending upon how the build was performed.
- --cmd_section_name¶
This flag allows specifying the command line section name. This is needed on non-ELF platforms where the section name might differ.
- --bitcode_section_name¶
This flag allows specifying the bitcode section name. This is needed on non-ELF platforms where the section name might differ.
Example: CMake¶
CMake can output a compilation_commands.json
compilation database if the
CMAKE_EXPORT_COMPILE_COMMANDS
switch is turned on at compile time. It is
also necessary to enable bitcode embedding (done by passing
-Xclang -fembed-bitcode=all
to all C/C++ compilation actions in the
non-ThinLTO case). For example, to extract a corpus from clang, you would
run the following commands (assuming that the system C/C++ compiler is clang):
cmake -GNinja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
-DCMAKE_C_FLAGS="-Xclang -fembed-bitcode=all" \
-DCMAKE_CXX_FLAGS="-Xclang -fembed-bitcode-all"
../llvm
ninja
- After running CMake and building the project, there should be a
compilation_commands.json
file within the build directory. You can then run the following command to create a corpus:
python3 ./extract_ir.py \
--input=./build/compile_commands.json \
--input_type=json \
--output_dir=./corpus
After running the above command, there should be a full
corpus of bitcode within the ./corpus
directory.
Example: Bazel Aquery¶
This tool also supports extracting bitcode from bazel in multiple ways
depending upon the exact configuration. For ThinLTO, a linker parameters file
is preferred. For the non-ThinLTO case, the script will accept the output of
bazel aquery
which it will use to find all the object files that are linked
into a specific target and then extract bitcode from them. First, you need
to generate the aquery output:
bazel aquery --output=jsonproto //path/to:target > /path/to/aquery.json
Afterwards, assuming that the build is already complete, you can run this script to create a corpus:
python3 ./extract_ir.py \
--input=/path/to/aquery.json \
--input_type=bazel_aqeury \
--output_dir=./corpus \
--obj_base_dir=./bazel-bin
This will again leave a corpus that contains all the bitcode files. This mode
does not capture all object files in the build however, only the ones that
are involved in the link for the binary passed to the bazel aquery
invocation.
Synopsis¶
Creates a corpus from a collection of bitcode files.
Options¶
- --input_dir¶
The input directory to search for bitcode files in.
- --output_dir¶
The output directory to place the constructed corpus in.
- --default_args¶
A list of space separated flags that are put into the corpus description. These are used by some tooling when compiling the modules within the corpus.
Synopsis¶
Combines two training corpora that share the same parent folder by generating
a new corpus_description.json
that contains all the modules in both corpora.
Options¶
- --root_dir¶
The root directory that contains subfolders consisting of the corpora that should be combined.
Interacting with ML models¶
We interact with ML models in 2 primary scenarios: one is to train such a model. The other, inference, is to use a model during compilation, to make optimization decisions.
For a specific optimization problem - i.e. inlining, or regalloc eviction - we first separate correctness - preserving decisions from optimization decisions. For example, not inlining functions marked “no inline” is an example of the former. Same is not evicting an unevictable live range. An example of the latter is deciding to inline a function that will bloat the caller size, just because we have reason to believe that later, the effect will be some constant propagation that will actually reduce the size (or dynamic instruction count).
ML models can be understood as functions. Their inputs are tensors - buffers of scalars. The output (in our case, singular) is a scalar. For example, for inlining, the inputs are properties of the caller, callee, and the callsite being analyzed for inlining. The output is a boolean.
Inputs and outputs are named, have a scalar type (e.g. int32_t) and a shape (e.g. 3x4). These are the elements that we use to bind to a ML model.
In both training and inference, we want to expose to ML (training algorithms or trained model, respectively) the features we want to make optimization decisions on. In that regard, the interface from the compiler side to the ML side is the same: pass features, and get a decision. It’s essentially a function call, where the parameters and result are bound by name and are described by name, scalar type, and shape tuples.
The main types in LLVM are:
MLModelRunner
- an abstraction for the decision making mechanismTensorSpec
which describes a tensor.
TensorSpec¶
See llvm/Analysis/TensorSpec.h
. This is a simple data bag, identifying a
tensor by name (a string), scalar type, and shape (a vector of ints). The scalar
type can only be int (8, 16, 32, or 64), signed or unsigned; float; or double.
MLModelRunner¶
See llvm/Analysis/MLModelRunner.h
. The abstraction has a pure virtual,
evaluateUntyped
, but the contract with implementers is a bit more involved:
Implementers¶
At construction, the implementer is expected to receive a list of TensorSpec
for input features and the TensorSpec
of the output (e.g.
std::vector<TensorSpec>
). The list type is not contractual, but it must be
a 0-based indexing array-like container. Given a TensorSpec
at index “I” in
the input list, that has a name “N”, shape “D1 x D2x … Dn”, and scalar type
“T”, the implementer must:
set up a contiguous buffer sized
sizeof(T) * D1 * D2 * ... * Dn
. This buffer’s lifetime must be the same as the lifetime of the implementer object.call
MLModelRunner::setUpBufferForTensor
passing I, theTensorSpec
, and the buffer above.
Internally, the expectation is that the implementer uses the name (and maybe
shape) of a TensorSpec
for binding (e.g. lookup in an underlying ML model).
MLModelRunner::setUpBufferForTensor
stores each buffer at the corresponding
index (i.e. its position in the list used at construction). The expectation is
that the user will use that position when calling MLModelRunner::getTensor
to retrieve the underlying buffer (more on that in a bit).
The implementation of evaluateUntyped
is expected to use the value in the
buffers described above, carry out whatever computation (e.g. evaluate a ML
model) and then place the outcome in an output buffer which will be returned to
the caller. Importantly, evaluateUntyped
must not reset the input buffers.
This is because during training we may want to log the features and decisions,
and since the data is already buffered, there’s no reason to force backing it
up elsewhere.
Users¶
The users must pass the input TensorSpec
list at the construction of a
specific MLModelRunner
object. After that, users can be agnostic of the
specific implementation, and would typically follow the following workflow:
call
getTensor
orgetTensorUntyped
, for each input tensor, identified by its index (i.e. the index of the correspondingTensorSpec
in the list used at construction).populate the tensor buffer of each input tensor with values. Users can take advantage of the stability of the tensor buffers like set only once those that don’t change, or cache the buffer address
call
evaluate
and use its result.
Versioning¶
We support a model “knowing” less inputs than the compiler. This is supported by
MLModelRunner::setUpBufferForTensor
. If a TensorSpec
requested by the
compiler is not supported by the underlying model, the MLModelRunner
implementer must still call setUpBufferForTensor
with a nullptr
value
for the buffer. In turn, MLModelRunner
will allocate an appropriately - sized
buffer and track its lifetime. The user can safely populate that buffer. Since
the rest of the inputs are still provided, this allows an evolution model where
we first add features to the compiler and continue using older models without
regressing. Then, the new compiler can be used to train new models. Deprecating
features in the compiler involves, then, training first a model without those
features.
MLModelRunner
implementations¶
We currently feature 4 implementations:
ModelUnderTrainingRunner
. This requires the compiler be built with TFLite support. It allows loading a TFLite model dynamically and is primarily intended for training scenarios, but it can be used relatively easily in production build environments, as it does not change how the compiler operates (why this remark is necessary will become clear in a few paragraphs)ReleaseModeModelRunner
. This is intended for inference scenarios. This uses the rules defined inllvm/cmake/modules/TensorFlowCompile.cmake
to convert, at the time the compiler is built, TensorFlow Saved Models into a header (.h) and native object (.o). The latter is a CPU-based implementation of the neural network, together with its weights (essentially, loops performing matrix multiplications)
NOTE: we are actively working on replacing this with an EmitC implementation requiring no out of tree build-time dependencies.
InteractiveModelRunner
. This is intended for training scenarios where the training algorithm drives compilation. This model runner has no special dependencies, and relies on I/O pipes to communicate with a separate process, presumably a python training algorithm. We do not envision using this in a production environment.NoInferenceModelRunner
. This serves as a store for feature values, and itsevaluate
should never be called. It’s used for training scenarios, when we want to capture the behavior of the default (non-ML) heuristic.
Note that training leaves it to the training infrastructure to handle distributed computing. The assumed architecture has python processes communicating remotely between themselves, but managing local communication with clang.
Logging Facility¶
When training models, we need to expose the features we will want to use during inference, as well as outcomes, to guide reward-based learning techniques. This can happen in 2 forms:
when running the compiler on some input, as a capture of the features and actions taken by some policy or a model currently being used. For example, see
DevelopmentModeInlineAdvisor
orDevelopmentModeEvictAdvisor
inMLRegallocEvictAdvisor.cpp
. In more detail, in the former case, if-training-log
is specified, the features and actions (inline/no inline) from each inlining decision are saved to the specified file. SinceMLModelRunner
implementations hold on to feature values (they don’t get cleared byevaluate
), logging is easily supported by just looping over the model runner’s features and passing the tensor buffers to the logger. Note how we use theNoInferenceModelRunner
to capture the features observed when using the default policy.as a serialization mechanism for the
InteractiveModelRunner
. Here, we need to pass the observed features over IPC (a file descriptor, likely a named pipe).
Both cases require serializing the same kind of data and we support both with
Analysis/Utils/TrainingLogger
.
The goal of the logger design was avoiding any new dependency, and optimizing for the tensor scenario - i.e. exchanging potentially large buffers of fixed size, containing scalars. We explicitly assume the reader of the format has the same endianness as the compiler host, and we further expect the reader and the compiler run on the same host. This is because we expect the training scenarios have a (typically python) process managing the compiler process, and we leave to the training side to handle remoting.
The logger produces the following sequence:
a header describing the structure of the log. This is a one-line textual JSON dictionary with the following elements:
features
: a list of JSON-serializedTensorSpec
values. The position in the list matters, as it will be the order in which values will be subsequently recorded. If we are just logging (i.e. not using theInteractiveModelRunner
), the last feature should be that of the action (e.g. “inline/no inline”, or “index of evicted live range”)(optional)
score
: aTensorSpec
describing a value we will include to help formulate a reward. This could be a size estimate or a latency estimate.(optional)
advice
: aTensorSpec
describing the action. This is used for theInteractiveModelRunner
, in which case it shouldn’t be in thefeatures
list.
a sequence of
contexts
. Contexts are independent traces of the optimization problem. For module passes, there is only one context, for function passes, there is a context per function. The start of a context is marked with a one-line JSON dictionary of the form{"context": <context name, a string>}
Each context has a sequence of:
observations
. An observation is:one-line JSON
{"observation": <observation number. 0-indexed>}
a binary dump of the tensor buffers, in the order in which they were specified in the header.
a new line character
if
score
was specified in the header:a one-line JSON object
{"outcome": <value>}
, where thevalue
conforms to theTensorSpec
in defined for thescore
in the header.the outcome value, as a binary dump
a new line character.
The format uses a mix of textual JSON (for headers) and binary dumps (for tensors) because the headers are not expected to dominate the payload - the tensor values are. We wanted to avoid overburdening the log reader - likely python - from additional dependencies; and the one-line JSON makes it rudimentarily possible to inspect a log without additional tooling.
A python utility for reading logs, used for tests, is available at
Analysis/models/log_reader.py
. A utility showcasing the InteractiveModelRunner
,
which uses this reader as well, is at Analysis/models/interactive_host.py
.
The latter is also used in tests.
There is no C++ implementation of a log reader. We do not have a scenario motivating one.
IR2Vec Embeddings¶
IR2Vec is a program embedding approach designed specifically for LLVM IR. It is implemented as a function analysis pass in LLVM. The IR2Vec embeddings capture syntactic, semantic, and structural properties of the IR through learned representations. These representations are obtained as a JSON vocabulary that maps the entities of the IR (opcodes, types, operands) to n-dimensional floating point vectors (embeddings).
With IR2Vec, representation at different granularities of IR, such as instructions, functions, and basic blocks, can be obtained. Representations of loops and regions can be derived from these representations, which can be useful in different scenarios. The representations can be useful for various downstream tasks, including ML-guided compiler optimizations.
- The core components are:
Vocabulary: A mapping from IR entities (opcodes, types, etc.) to their vector representations. This is managed by
IR2VecVocabAnalysis
.Embedder: A class (
ir2vec::Embedder
) that uses the vocabulary to compute embeddings for instructions, basic blocks, and functions.
Using IR2Vec¶
For generating embeddings, first the vocabulary should be obtained. Then, the
embeddings can be computed and accessed via an ir2vec::Embedder
instance.
Get the Vocabulary: In a ModulePass, get the vocabulary analysis result:
auto &VocabRes = MAM.getResult<IR2VecVocabAnalysis>(M); if (!VocabRes.isValid()) { // Handle error: vocabulary is not available or invalid return; } const ir2vec::Vocab &Vocabulary = VocabRes.getVocabulary();
Note that
IR2VecVocabAnalysis
pass is immutable.Create Embedder instance: With the vocabulary, create an embedder for a specific function:
// Assuming F is an llvm::Function& // For example, using IR2VecKind::Symbolic: Expected<std::unique_ptr<ir2vec::Embedder>> EmbOrErr = ir2vec::Embedder::create(IR2VecKind::Symbolic, F, Vocabulary); if (auto Err = EmbOrErr.takeError()) { // Handle error in embedder creation return; } std::unique_ptr<ir2vec::Embedder> Emb = std::move(*EmbOrErr);
Compute and Access Embeddings: Call
getFunctionVector()
to get the embedding for the function.const ir2vec::Embedding &FuncVector = Emb->getFunctionVector();
Currently,
Embedder
can generate embeddings at three levels: Instructions, Basic Blocks, and Functions. Appropriate getters are provided to access the embeddings at these levels.Note
The validity of
Embedder
instance (and the embeddings it generates) is tied to the function it is associated with remains unchanged. If the function is modified, the embeddings may become stale and should be recomputed accordingly.Working with Embeddings: Embeddings are represented as
std::vector<double>
. These vectors as features for machine learning models, compute similarity scores between different code snippets, or perform other analyses as needed.
Further Details¶
For more detailed information about the IR2Vec algorithm, its parameters, and
advanced usage, please refer to the original paper:
IR2Vec: LLVM IR Based Scalable Program Embeddings.
The LLVM source code for IR2Vec
can also be explored to understand the
implementation details.
Building with ML support¶
NOTE For up to date information on custom builds, see the ml-*
build bots. They are set up using
like this.
Embed pre-trained models (aka “release” mode)¶
This supports the ReleaseModeModelRunner
model runners.
You need a tensorflow pip package for the AOT (ahead-of-time) Saved Model compiler
and a thin wrapper for the native function generated by it. We currently support
TF 2.15. We recommend using a python virtual env (in which case, remember to
pass -DPython3_ROOT_DIR
to cmake
).
Once you install the pip package, find where it was installed:
TF_PIP=$(sudo -u buildbot python3 -c "import tensorflow as tf; import os; print(os.path.dirname(tf.__file__))")``
Then build LLVM:
cmake -DTENSORFLOW_AOT_PATH=$TF_PIP \
-DLLVM_INLINER_MODEL_PATH=<path to inliner saved model dir> \
-DLLVM_RAEVICT_MODEL_PATH=<path to regalloc eviction saved model dir> \
<...other options...>
The example shows the flags for both inlining and regalloc, but either may be omitted.
You can also specify a URL for the path, and it is also possible to pre-compile
the header and object and then just point to the precompiled artifacts. See for
example LLVM_OVERRIDE_MODEL_HEADER_INLINERSIZEMODEL
.
Note that we are transitioning away from the AOT compiler shipping with the tensorflow package, and to a EmitC, in-tree solution, so these details will change soon.
Using TFLite (aka “development” mode)¶
This supports the ModelUnderTrainingRunner
model runners.
Build the TFLite package using this script.
Then, assuming you ran that script in /tmp/tflitebuild
, just pass
-C /tmp/tflitebuild/tflite.cmake
to the cmake
for LLVM.
Interactive Mode (for training / research)¶
The InteractiveModelRunner
is available with no extra dependencies. For the
optimizations that are currently MLGO-enabled, it may be used as follows:
for inlining:
-mllvm -enable-ml-inliner=release -mllvm -inliner-interactive-channel-base=<name>
for regalloc eviction:
-mllvm -regalloc-evict-advisor=release -mllvm -regalloc-evict-interactive-channel-base=<name>
where the name
is a path fragment. We will expect to find 2 files,
<name>.in
(readable, data incoming from the managing process) and
<name>.out
(writable, the model runner sends data to the managing process)